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Abstract 

Background: The random forest (RF) method is a commonly used tool for classification with high dimensional data 
as well as for ranking candidate predictors based on the so-called random forest variable importance measures 
(VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced 
data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification 
performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the 
performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we 
explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative 
permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class 
imbalance. 

Results: We investigated the performance of the standard permutation VIM and of our novel AUC-based 
permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the 
new AUC-based permutation VIM outperforms the standard permutation VIM for unbalanced data settings while 
both permutation VIMs have equal performance for balanced data settings. 

Conclusions: The standard permutation VIM loses its ability to discriminate between associated predictors and 
predictors not associated with the response for increasing class imbalance. It is outperformed by our new 
AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the 
case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF 
variant based on conditional inference trees. The codes implementing our study are available from the companion 
website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html. 

Keywords: Random forest, Conditional inference trees, Variable importance measure, Feature selection, Unbalanced 
data, Class imbalance, Area under the curve. 



Background 

In bioinformatics and related fields, such as statistical 
genomics and genetic epidemiology, data are often 
highly correlated, heterogeneous and high-dimensional, 
with the number of predictors, also known as features or 
descriptors, exceeding the number of observations. The 
random forest (RF) approach developed by Leo Breiman 
in 2001 [1] is particularly appropriate to handle such 
complex data [2]. In bioinformatics, RF is a commonly 
used tool for classification or regression purposes as well 
as for ranking candidate predictors through its inbuilt 
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variable importance measures (VIMs). It has been used in 
many applications involving high-dimensional data. As a 
nonparametric method RF can deal with nonlinearity, 
interactions, correlated predictors and heterogeneity, 
which makes it attractive in genetic epidemiology [3-7]. 
However in the context of classification, i.e. when the 
response to be predicted is a class membership, classifica- 
tion performance of RF has been shown to be suboptimal 
in case of strongly unbalanced data [8-10], i. e. when class 
sizes differ considerably. 

In epidemiology, unbalanced data are observed, e.g., in 
population-based studies where only a small number of 
subjects develop a certain disease over time, while most 
subjects remain healthy. Unbalanced data are also com- 
mon in screening studies, where most of the screened 
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persons are negative, as well as in subclass analyses, e.g., 
if one wants to differentiate between different subtypes 
of cancer. Usually some subclasses are more common 
than other subclasses leading to an imbalance in class 
sizes. Studies on rare diseases are a further example of 
unbalanced data settings in medicine. Data can be 
obtained only from few persons having the specific rare 
disease, while samples from healthy control persons are 
much easier to obtain. Of course unbalanced data are 
also relevant in various other areas of application beyond 
the biomedical field, e.g., the prediction of creditworthi- 
ness of a bank's costumers [11], the detection of fraudulent 
telephone calls [12] or the detection of oil spills in satellite 
radar images [13], just to name a few examples. Unbal- 
anced data may arise whenever the class memberships are 
observed after data collection. 

Like many other classification methods RF produces 
classification rules that do not accurately predict the mi- 
nority class if data are unbalanced. The RF classifier allo- 
cates new observations more often to the majority class 
unless the difference between the classes is large and 
classes are well separable. For extreme class imbalances, 
e.g. if the minority class includes only 5% of the observa- 
tions, it might happen that the RF classifier allocates 
every observation to the majority class independently of 
the predictors, yielding a minimal error rate of 5%. 
Although this error rate of 5% is very small, such a trivial 
classification is of no practical use. 

Some suggestions have been made to yield a useful 
classification based either on sampling procedures 
[14-17] or on cost sensitivity analyses [14]. Sampling 
procedures create an artificial balance between two or 
more classes by oversampling the minority class and/or 
downsampling the majority class. Cost sensitivity ana- 
lyses attribute a higher cost to the misclassification of an 
observation from the minority class to impede the trivial 
systematic classification to the larger class. Both aspects 
have been widely discussed in the literature with respect 
to RF's classification performance [14,15,18-21]. Recent 
simulation studies [9] have shown that the performance 
of RF classification for unbalanced data depends on (i) 
the imbalance ratio, (ii) the class overlap and (iii) the 
sample size. 

The impact of class imbalance on the RF VIM, how- 
ever, has to our knowledge not yet been examined in the 
literature. In this article we focus on the permutation 
VIM which is known to be almost unbiased and more 
reliable than the Gini VIM. The latter has been shown 
to have a preference for certain types of predictors 
[22-25] and therefore its rankings have to be treated 
with caution. We concentrate on the class imbalance 
problem for two response classes with respect to the 
permutation VIM. We investigate the mechanisms of 
changes in performance for unbalanced data settings 



and motivate the use of a new permutation VIM which 
is not based on the error rate but on the area under the 
curve (AUC). The AUC can be seen as an accuracy 
measure putting the same weight on both classes - in 
contrast to the error rate which essentially gives more 
weight to the majority class. As such, the AUC is a 
particularly appropriate prediction accuracy measure in 
unbalanced data settings [26]. A permutation VIM in 
which the error rate is replaced by the AUC is therefore 
a promising alternative to the standard permutation 
VIM. We performed extensive simulation studies to ex- 
plore and compare the behaviour of both permutation 
VIMs for different class imbalance levels, effect sizes 
and sample sizes. 

Methods 

The RF algorithm is a classification and regression 
method often used for high-dimensional data settings 
where the number of predictors exceeds the number of 
observations. Note that throughout this article we use 
the term predictors which is equivalent to features or 
descriptors denoting variables that are used to discrimin- 
ate the response classes. In the RF algorithm several 
individual decision trees are combined to make a final 
prediction. The final prediction is then the average (for 
regression) or the majority vote (for classification) of the 
predictions of all trees in the forest. Each tree is fitted to 
a random sample of observations (with or without 
replacement) from the original sample. Observations not 
used to construct a tree are termed out-of-bag (OOB) 
observations for that tree. For each split in each tree a 
randomly drawn subset of predictors is assessed as can- 
didates for splitting and the predictor yielding the best 
split is finally chosen for the split. In the original version 
of RF developed by Leo Breiman [1], the selected split is 
the split with the largest decrease in Gini impurity. In a 
later version of RF, conditional inference tests are used 
for selecting the best split in an unbiased way [27]. For 
each split in a tree, each candidate predictor from the 
randomly drawn subset is globally tested for its associ- 
ation with the response, yielding a global p-value. The 
predictor with the smallest p-value is selected, and 
within this globally selected predictor the best split is 
finally chosen for the split. 

Both forest versions implement so called variable 
importance measures which can be used to get a ranking 
of the predictors according to their association with the 
response. In the following, we briefly introduce the 
standard permutation VIM as well as our novel permuta- 
tion VIM, which is based on the area under the curve. 

Random forest variable importance measures 

RF's variable importance measures are often used for 
feature selection for high-dimensional data settings 
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which makes it especially attractive for bioinformatics 
and related fields, where identifying a subset of relevant 
predictors from a large set of candidate predictors is a 
major challenge (known as the "small n large p" prob- 
lem). The two standard VIMs for feature selection with 
RF are the Gini VIM and the permutation VIM. Roughly 
speaking the Gini VIM of a predictor of interest is the 
sum over the forest of the decreases of Gini impurity 
generated by this predictor whenever it was selected for 
splitting, scaled by the number of trees. This measure 
has been shown to prefer certain types of predictors 
[22-25]. The resulting predictor ranking should therefore 
be treated with caution. That is why in this paper we 
focus on the permutation VIM that gives essentially 
unbiased error rate rankings of the predictors. 

Error-rate-based permutation VIM 

From now on, we denote the standard permutation VIM 
as "error-rate-based permutation VIM", since it is based 
on the OOB error rate, as outlined below. More precisely, 
it measures the difference between the OOB error rate 
after and before permuting the values of the predictor 
of interest. The error-rate-based permutation variable 
importance (VI) for predictor j is defined by: 

(ER) = J_ynt.ee/ X 

> ntree V t} ') y ' 

Where 

• ntree denotes the number of trees in the forest, 

• ER t j denotes the mean error rate over all OOB 
observations in tree t before permuting predictor j, 

• ER t ] denotes the mean error rate over all OOB 
observations in tree t after randomly permuting 
predictor j. 

The idea underlying this VIM is the following: If the 
predictor is not associated with the response, the permu- 
tation of its values has no influence on the classification, 
and thus also no influence on the error rate. The error 
rate of the forest is not substantially affected by the per- 
mutation and the VI of the predictor takes a value close 
to zero, indicating no association between the predictor 
and the response. In contrast, if response and predictor 
are associated, the permutation of the predictor values 
destroys this association. "Knocking out" this predictor 
by permuting its values results in a worse classification 
leading to an increased error rate. The difference in 
error rates before and after randomly permuting the 
predictor thus takes a positive value reflecting the high 
importance of this predictor. 



A novel AUC-based permutation VIM 

Our new AUC-based permutation VIM is closely related 
to the error-rate-based permutation VIM. They only differ 
with respect to the prediction accuracy measure: In a 
nutshell, the error rate of a tree involved in (1) is replaced 
by the area under the curve (AUC) [28]. We define the 
AUC-based permutation VI for predictor j as: 

vi (AUC) = -^TTT f AUC 'i - AUC tj) ( 2 ) 

> ntree* ^t=i V ' t)J v ' 

• ntree* denotes the number of trees in the forest 
whose OOB observations include observations from 
both classes, 

• AUC t j denotes the area under the curve computed 
from the OOB observations in tree t before 
permuting predictor j, 

• AUCtj denotes the area under the curve computed 
from the OOB observations in tree t after randomly 
permuting predictor j. 

Instead of computing the error rate for each tree after 
and before permuting a predictor, the AUC is computed. 
The AUC for a tree is based on the so-called class prob- 
abilities, i.e. the estimated probability of each observa- 
tion to belong to the class Y = 0 or Y = 1, respectively. 
The class probabilities of an observation are determined 
by the relative amount of training observations belong- 
ing to the corresponding class in the terminal node in 
which an observation falls into. If one considers an 
OOB observation with Y = 0 and an OOB observation 
with Y = 1, a "good tree" is expected to assign a larger 
class probability for class Y = 1 to the observation truly 
belonging to class Y = 1 than to the observation belong- 
ing to class Y = 0. The AUC for a tree corresponds to 
the proportion of pairs for which this is the case. It can 
be seen as an estimator of the probability that a ran- 
domly chosen observation from class Y = 1 receives a 
higher class probability for class Y = 1 than a randomly 
chosen observation from class Y = 0. Note that with the 
use of the AUC, the information contained in the class 
probabilities returned by a tree are adequately exploited. 
This is not the case for the error rate, that requires a 
dichotomization of class probabilities. From a practical 
point of view, the AUC is computed by making use of 
its equivalence with the Mann-Whitney-U statistic. The 
Mann-Whitney-U statistic is solely based on the rankings 
of two independent samples. AUC values of 1 correspond 
to a perfect tree classifier, since a perfect classifier would 
attribute each observation from one class a higher prob- 
ability to belong to this class than any observation from 
the other class. AUC values of 0.5 correspond to a useless 
tree classifier that randomly allocates class probabilities to 
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the observations. In this case in about half the cases a ran- 
domly drawn observation from one class receives a higher 
probability of belonging to that class than a randomly 
drawn observation from the other class. 

The novel AUC-based permutation VIM is implemented 
in the package party for the unbiased RF variant based on 
conditional inference trees. Note that the discrepancy in 
performance between the standard permutation VIM 
and the AUC-based permutation VIM is transferable to 
the original version of RF since the VI ranking mechan- 
ism is completely independent from the construction of 
the trees. 

Comparison studies 

The behavior of the two introduced permutation VIMs is 
expected to be different in the presence of unbalanced 
data. The AUC is a prediction accuracy measure which 
puts the same weight on both classes independently of 
their sizes [26]. The error rate, in contrast, gives essentially 
more weight to the majority class because it does not take 
class affiliations into account and regards all misclassifica- 
tions equally important. In the results section we try to 
explain the consequences for the performance of the per- 
mutation VIMs for unbalanced data settings and provide 
evidence for our supposition. We performed studies on 
simulated and on real data to explore and contrast the 
performance of both permutation VIMs. Using simulated 
data we aim to see whether total sample size and effect 
size play a role for the class imbalance problem. We 
explored this by varying the total number of observations 
and by simulating predictors with different effect sizes. 
Furthermore we conducted analyses based on real data to 
provide additional evidence based on realistic data struc- 
tures which usually incorporate complex interdependen- 
cies. Our comparison studies on simulated and on real 
data were conducted using the unbiased RF variant based 
on conditional inference trees. The implementation of this 
unbiased RF variant is available in the R system for statis- 
tical computing via the package party [29] . 

Simulated data 

The considered simulation design represents a scenario 
where the predictors associated with the response vari- 
able Y (binary) are to be identified from a set of continu- 
ous predictors. We performed simulations for varying 
imbalance levels: 50% corresponding to a completely 
balanced sample, 40%, 30%, 20%, 10%, 5% and 1% corre- 
sponding to different imbalance levels from slight to 
very extreme class imbalances. The simulation setting 
comprises both predictors not associated with the re- 
sponse and associated predictors with three different 
levels of effect sizes. Table 1 presents the data setting 
used throughout this simulation. 
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Table 1 Distribution of predictors in class 1 and class 2 



Predictors 


Distribution in 
class 1 


Distribution in 
class 2 


Effect size 


X, x 5 


N (1.00, 1) 


N (0, 1) 


strong effect 


^6 X q0 


N (0.75, 1) 


N (0, 1) 


moderate effect 


X-n x q5 


N (0.50, 1) 


N (0, 1) 


weak effect 


A16 X 65 


N (0, 1) 


N (0, 1) 


no effect 



The first five predictors Xi, . . ., X 5 differ strongly be- 
tween classes with mean ^ = 1 in one class and mean 
u 2 = 0 in the other class. The predictors X 6 , . . ., Xi 0 have 
a moderate mean difference between the two classes with 
Ux = 0.75 and u 2 = 0. For X n , . . ., X 15 there is only a small 
difference between the classes with ui = 0.5 and u 2 = 0. 
We simulated 50 additional predictors following a 
standard normal distribution with no association to the 
response variable (termed noise predictors). 

We performed analyses with varying sample sizes and 
report the results for total sample sizes of n = 100, n = 500 
and n = 1000. For each parameter combination, i.e. imbal- 
ance level and sample size, we simulated 100 datasets and 
computed AUC-based and error-rate-based permutation 
Vis for each dataset. Note that for a sample size of n = 100 
an imbalance of 1% is not meaningful since there is only 
one observation in the minority class. 

Forest and tree parameters were held fixed. The par- 
ameter ntree denoting the number of trees in a forest 
was set to 1000, the parameter for the number of candi- 
date splits mtry was set to the default value of 5. We 
used subsampling instead of bootstrap sampling for 
constructing the trees, i.e. setting the parameter replace 
to FALSE [22]. Conditional inference trees were grown 
to maximal possible depth, i.e. setting the parameters 
minsplit, minbucket and mincriterion in the cforest 
function to zero. 

Real data 

We also investigated the performance of the error-rate- 
based and the AUC-based permutation VIM on real 
data including complex dependencies (e.g. correlations) 
and predictors of different scales. The dataset is about 
RNA editing in land plants [30]. RNA editing is the 
modification of the RNA sequence from the corre- 
sponding DNA template. It occurs e.g. in plant mito- 
chondria where some cytidines are converted to 
uridines before translation (abbreviated with C-to-U 
conversion in the following). The dataset comprises a 
total of 43 predictors: 41 categorical predictors (40 nu- 
cleotides at positions -20 to 20 relative to the edited site 
and one predictor describing the codon position) and 
two continuous predictors (one for the estimated fold- 
ing energy and one predictor describing the difference 
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in estimated folding energy between pre-edited and 
edited sequences). It includes 2694 observations, where 
exactly one half has an edited site and the other half has 
a non-edited site. The data are publicly available from 
the journal's homepage. After excluding observations 
with missing values, a total of 2613 observations were 
left, where 1307 had a non-edited site and 1306 obser- 
vations had an edited site. We used this balanced 
dataset to explore the performance of ER- and AUC- 
based permutation VIM for varying class imbalances - 
but now with realistic dependencies and predictors of 
different scales. For this purpose, we artificially created 
different imbalance levels by drawing random subsets 
from the class with edited sites. 

Application of the standard permutation VIM to the 
data using the 2613 observations without missing values 
gave Vis greater than zero for all 43 predictors for 
different random seeds (i.e. different starting values for 
the random permutation), indicating that all predictors 
seem to have at least a small predictive power (data not 
shown). We generated and added additional predictors 
without any effect (termed noise predictors in the fol- 
lowing) in order to evaluate the performance of error- 
rate-based and AUC-based permutation VIMs. Provided 
that there is a higher association between the response 
and any of the original predictors than between the re- 
sponse and any of the simulated noise predictors, a well 
performing VIM would attribute a higher VI to original 
predictors than to simulated noise predictors. The noise 
predictors were generated by randomly permuting the 
values of the original predictors. Each original predictor 
was permuted once, resulting in a total of 43 noise 
predictors. The whole process consisting of (1) creating 



43 noise predictors, (2) merging them to the original 
dataset, (3) randomly subsampling to create an unbal- 
anced dataset and (4) computing the error-rate-based 
and AUC-based permutation Vis, was repeated 100 
times for each imbalance level to get stable results for 
the VIM performance. To check the assumption that 
there is a higher association between the response and 
any of the original predictors than between the response 
and any of the simulated predictors, we computed the 
mean VI over 100 completely balanced datasets that had 
been extended by noise predictors. Figure 1 shows that 
all mean Vis of the original predictors are higher than 
any mean VI of a simulated noise predictor and hence 
confirms our first impression. 

Performance evaluation criteria 

VIMs give a ranking of the predictors according to their 
association with the response. To evaluate the quality of 
the rankings by the permutation VIMs the AUC was 
used as performance measure. The AUC was computed 
to assess the ability of a VIM to differentiate between 
associated predictors and predictors not associated with 
the response. AUC values of 1 mean that each associated 
predictor receives a higher VI than any noise predictor, 
thus indicating a perfect discrimination. AUC values of 
0.5 mean that a randomly drawn associated predictor 
receives a higher VI than a randomly drawn noise pre- 
dictor in only half of the cases, indicating no discrimina- 
tive ability. 

For our comparison studies we defined the two classes 
which are to be differentiated by a VIM in the following 
way. In the first instance of our studies on simulated data, 
all predictors which are associated with the response 
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Figure 1 Mean Vis for the 43 original predictors and 43 noise predictors from the balanced modified C-to-U conversion dataset. Mean 
Vis were obtained by averaging the Vis (by commonly used error-rate-based permutation VIM) over 100 extended versions of the C-to-U 
conversion dataset. 
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formed one class and noise predictors built the other class. 
In more detailed subsequent analyses we then explored 
the ability of the VIMs to discriminate between predictors 
with the same effect size and predictors without an effect. 
For this analysis one class comprised the noise predictors 
while the other class comprised only predictors with the 
same effect. For the studies on real data it was not possible 
to conduct such detailed analyses because the true order- 
ing of the predictors according to their association with 
the response is not known. Hence in the analysis on real 
data we restricted our analysis to the discrimination be- 
tween original predictors forming one class and simulated 
noise predictors forming the other class. 

Results and discussion 

Why may the error-rate-based permutation VIM fail in 
case of class imbalance? 

The prioritisation of the majority class in unbalanced data 
settings is well known in the context of RF classification 
and can easily be seen from trees constructed on unbal- 
anced data. Trees trained on unbalanced data more often 
predict the majority class, which leads to the minimization 
of the overall error rate. But how does this affect the 
performance of the permutation VIMs? And why is the 
AUC-based permutation VIM expected to be more robust 
towards class imbalance than the commonly used error- 
rate-based permutation VIM? 

To answer these questions we consider an extremely un- 
balanced data setting and illustrate what happens in a tree 
when permuting the values of an associated predictor. We 
will first have a look at observations from the majority 
class. For this class nearly all observations are correctly 
classified by a tree which has been trained on extremely 
unbalanced data. If we now permute the values of an asso- 
ciated predictor, this does generally not result in a classifi- 
cation into the minority class since a classification into the 
minority class is an unlikely event - even for an observa- 
tion from this class. A very specific data pattern is required 
for an observation to be classified into the minority class. 
It is unlikely that a random permutation of an associated 
predictor results in such a specific data pattern just by 
chance. Thus, for the majority class we expect hardly any 
observation to be incorrecdy classified to the minority 
class after the permutation of an associated predictor. 
Thus the error rate does not considerably increase after 
the permutation of an associated predictor, finally leading 
to a rather low contribution to the VI. 

Now let us consider the classifications by a tree for ob- 
servations from the minority class. For an extreme class 
imbalance most of the observations from the minority 
class are falsely classified to the majority class due to the 
above described focus on the majority class. It might be 
the case that some observations from the minority class 
are correctly classified by the tree because these 



observations have that specific pattern of predictor 
values which is required for an observation to be classi- 
fied into the minority class. It is likely that a permuta- 
tion of the values of an associated predictor might then 
destroy that specific pattern so that after the permuta- 
tion, these observations are not identified anymore to 
be in the minority class. Thus a misclassification due to 
the elimination of an associated predictor is much more 
likely to appear in observations from the minority class 
than in observations from the majority class. Note that 
only a small number of observations from the minority 
class are affected since most of the observations from 
the minority class are classified into the majority class 
anyway (before as well as after the permutation). The 
change in error rates is thus expected to be rather small - 
albeit it is more pronounced than the change in error rates 
in the majority class. 

Note that the error-rate-based permutation VIM does 
not take class affiliations into account. Thus the change 
in error rates is actually not computed separately for 
each class. Yet, in order to better understand the behav- 
ior of the VIM, it may help to point out that if the class 
proportions were the same in all OOB samples, the VI 
of a predictor could be directly derived as the weighted 
average of the class specific differences in the error rates. 
The weights would correspond to the proportion of obser- 
vations from the respective class. In practice the class 
frequencies will not be equal in all OOB samples, but the 
concept of a weighted average of the class specific error 
rates illustrates the fact that for unbalanced data settings 
the VI is mainly driven by the change in error rates 
derived from observations from the majority class. Since 
the change in error rates in the majority class is expected 
to be much smaller compared to the change in error rates 
in the minority class, the computed Vis are rather low. 
This results in low Vis even for associated predictors and 
in a poor differentiation of associated predictors and 
predictors not associated with the response. 

Class specific Vis 

This theory is supported by computing class specific Vis 
(corresponding to mean changes in error rates computed 
only from observations belonging to the same class). 
Computing class specific Vis was done using the R 
package randomForest implementing the standard RF 
algorithm. The importance function of this package 
provides permutation Vis computed separately for each 
class (besides the Vis by the standard permutation VIM 
and by the Gini VIM). The class specific Vis for a total 
sample size of n = 500 and an imbalance level of 5% are 
shown in Figure 2, where predictors to X 15 have an 
effect while the remaining 50 predictors do not have an 
effect, corresponding to the simulation setting previously 
described in Table 1 in the context of the comparison 
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Figure 2 Vis computed only from OOB observations of the minority class (top), from OOB observations of the majority class (middle) 
and from all OOB observations (bottom). The first 15 predictors are associated with the response while the remaining predictors are noise 
predictors. Vis are shown for a total sample size of n = 500 and an imbalance level of 5%. 



study (for simplicity, we use the same setting as in the 
comparison study, although the addressed problem is here 
a different one). Different sample sizes and imbalance 
levels give similar results (thus not shown). They confirm 
our argumentation that the change in the error rates com- 
puted from OOB observations from the majority class is 
smaller than the change in error rates computed from 
OOB observations from the minority class. This results in 
an underestimation of the actual permutation VI due to a 
much higher weighting of the majority class in the compu- 
tation of the VI (see concordance of Vis in middle and 
lower panel of Figure 2). The discrepancy between the Vis 
computed from observations of the minority class and Vis 
computed from observations of the majority class depends 
on the class imbalance and is more pronounced for more 
extreme class imbalances. 

This motivates the use of an alternative accuracy 
measure which better incorporates the minority class. 
While the error rate gives the same weight to all obser- 
vations, therefore focusing more on the majority class, 
the AUC is a measure which does not prefer one class 
over the other but instead puts exactly the same weight 
on both classes. Therefore the AUC-based permutation 
VIM is expected to detect changes in tree predictions 



for observations from the minority class, which might 
not be grasped by the error-rate-based permutation VIM 
due to a much higher weighting of the majority class. 
The Vis for associated predictors obtained by the AUC- 
based permutation VIM are thus expected to be compara- 
tively higher than the Vis obtained by the error-rate-based 
permutation VIM. This would result in a better differenti- 
ation of associated and noise predictors by the AUC-based 
permutation VIM. These conjectures are assessed in the 
comparison study presented in the next section. (An add- 
itional performance comparison between the AUC-based 
permutation VIM and the error-rate-based permutation 
VIM based only on observations from the minority class 
is documented in Additional file 1.) 

Comparison study with simulated data 

The performance of the error-rate-based and AUC- 
based VIMs as measured by the AUC is shown in 
Figure 3 for the three different total sample sizes with 
n = 100 (left panel), n = 500 (middle panel) and n = 
1000 observations (right panel) and different class imbal- 
ance levels. Filled boxes correspond to the AUC-based 
permutation VIM and unfilled boxes correspond to the 
error-rate-based permutation VIM. Figure 3 shows that 
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Figure 3 Distribution of AUC-values for 100 simulated datasets for AUC-based (filled) and error-rate-based (unfilled) permutation VIMs 
for different class imbalances. The AUC is used to assess the ability of a VIM to discriminate between predictors with an effect and predictors 
without an effect. Distributions are shown for total sample sizes of n = 100 (left panel), n = 500 (middle panel) and n = 1000 (right panel). 



the performance of both VIMs decreases with an in- 
creasing class imbalance for all sample sizes. Note that 
the decrease in performance for both VIMs is not solely 
attributable to the imbalance ratio per se but also to the 
reduced number of observations in the minority class 
with an increasing class imbalance. This is induced by 
the simulation setting since we held the total number of 
observations fixed and varied the number of observa- 
tions in both classes to create different class imbalances. 
If there are only few observations in one class then the 
tree predictions are less accurate. However the perform- 
ance of the AUC-based permutation VIM decreases less 
dramatically than the performance of the error-rate 
-based permutation VIM. The discrepancy in perfor- 
mances between the VIMs increases with increasing im- 
balance level and is maximal for the most extreme class 
imbalance. While for a sample size of n = 500 the error- 
rate-based permutation VIM is no longer able to dis- 
criminate between associated and noise predictors (AUC 
values randomly vary around 0.5) for the most extreme 
class imbalance of 1%, the AUC-based permutation VIM 
still is, showing that it can be used to identify associated 
predictors even if the minority class comprises only few 
observations. It can be ruled out that the better per- 
formance of the AUC-based permutation VIM is due 
to chance since the distributions of AUC values sig- 
nificantly differ. Furthermore this difference in perfor- 
mances between both VIMs becomes even larger for 
larger sample sizes. 

In a nutshell, in this first simulation the AUC-based 
permutation VIM performed better in case of class im- 
balance. The following subsections focus on the influ- 
ence of sample size and effect size on the respective 
performance of both permutation VIMs in unbalanced 
data settings. 



Influence of sample size 

In Figure 3, the performance of both VIMs improves 
with an increased total sample size for a fixed imbalance 
level since an increase in the sample size results in more 
accurate tree predictions. The right panel of Figure 3 
shows that both permutation VIMs are hardly affected 
by class imbalances up to 10% when the sample size is 
rather large (n = 1000). If the sample size is smaller 
(n = 100), however, the performance of the VIMs is con- 
siderably decreased for a 10% imbalance level. A de- 
crease in performance for a 10% imbalance level is also 
observed for a sample size of n = 500, especially for 
error-rate-based permutation VIM. In a nutshell, class 
imbalance seems to be more problematic for the permu- 
tation VIMs if the total sample size is small. 

Influence of effect size 

We now explore the ability of the permutation VIMs to 
identify predictors with different effect sizes in presence 
of unbalanced data. The AUC was again used as an 
evaluation criterion to compare the ability of the AUC- 
based and error-rate-based permutation VIMs to discrim- 
inate between associated and non-associated predictors. 
Here the evaluation was done for each effect size separately 
meaning that one class comprised all the noise predictors 
while the other class comprised only predictors with the 
considered effect size (either strong, moderate or weak). 
Figure 4 shows the results for the setting with n = 100. The 
results for other sample sizes are shown in Additional file 2. 
The left panel of Figure 4 shows the performance of both 
permutation VIMs according to their ability to discriminate 
between predictors with weak effects and predictors 
without an effect. The middle panel corresponds to the 
AUC values for predictors with a moderate effect versus 
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Figure 4 Distribution of AUC-values for 100 simulated datasets for AUC-based (filled) and error-rate-based (unfilled) permutation VIMs 
for different class imbalances. The AUC is used to assess the ability of a VIM to discriminate between noise predictors and predictors with a 
weak (left panel), moderate (middle panel) and strong (right panel) effect. Distributions are shown for a total sample size of n = 100. 



noise predictors and the right panel corresponds to the 
AUC values for predictors with a strong effect versus 
noise predictors. 

Unsurprisingly, for both permutation VIMs predictors 
having only a weak effect are less discriminable from 
noise predictors than predictors with stronger effects. 
For imbalances up to 20% both VIMs identify nearly all 
predictors with a strong effect. Obviously there are un- 
balanced data settings where the standard permutation 



VIM still perfectly separates between noise predictors 
and predictors with pronounced effects. We conclude 
that class imbalance is more problematic if predictors 
with weak effects are to be identified while it plays a 
minor role if the classes are well separable. 

Comparison study with real data 

Figure 5 shows the distribution of AUC values for 100 
modified C-to-U conversion datasets for varying 
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Figure 5 Distribution of AUC-values for AUC-based (filled) and error-rate-based (unfilled) permutation VIMs for different class 
imbalances derived from 100 modified datasets from C-to-U conversion data. The AUC is used to assess the ability of a VIM to discriminate 
between associated predictors and predictors not associated with the response. 
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imbalance levels. For the balanced dataset and for slight 
class imbalances up to 40% both VIMs have a perfect 
discriminative ability since all associated predictors re- 
ceive a higher VI than any noise predictor. Overall the 
performance of both VIMs decreases with an increas- 
ing class imbalance. Note that the decreasing perform- 
ance for increasing class imbalances might be partly 
attributable to the reduced total sample size as the 
class imbalance was created by randomly subsampling 
observations from the class with the edited sites. When 
comparing both VIMs the AUC-based permutation 
VIM significantly outperformed the standard permuta- 
tion VIM. For an imbalance of 30% the AUC-based 
permutation VIM clearly identified more associated 
predictors than the error-rate-based permutation VIM. 
The superiority of the AUC-based permutation VIM 
over the standard permutation VIM increased with an 
increasing class imbalance. For imbalances between 
15% and 5% the discrepancy between the perform- 
ance of AUC-based and standard permutation VIM 
was maximal. 

Overall, this study on real data impressively shows that 
the AUC-based permutation VIM also works for complex 
real data and outperforms the standard permutation VIM 
in almost all class imbalance settings. 

Conclusions 

The problem of unbalanced data has been widely discussed 
in the literature for diverse classifiers including random 
forests. Many approaches have been developed to improve 
the predictive ability of RF classifiers for unbalanced data 
settings. However less attention has been paid to the be- 
haviour of random forests' variable importance measures 
for unbalanced data. In this paper we explored the 
performance of the permutation VIM for different class 
imbalances and proposed an alternative permutation VIM 
which is based on the AUC. 

Our studies on simulated as well as on real data show 
that the commonly used error-rate-based permutation 
VIM loses its ability to discriminate between associated 
predictors and predictors not associated with the re- 
sponse for increasing class imbalances. This is particu- 
larly crucial for small sample sizes and if predictors with 
weak effects are to be detected. The decreasing perform- 
ance of the standard permutation VIM results from two 
sources: the class imbalance on the training data level 
leading to trees more often predicting the majority class 
and the class imbalance at the OOB data level leading to 
blurred Vis due to a much higher weighting of error rate 
differences in the majority class. A higher weighting of 
the majority class in the VI calculation is problematic 
because the difference in error rates is shown to be less 
pronounced in the majority class than in the minority 
class. Note that in some cases it might be interesting to 



assess the increase in error rate obtained when a certain 
predictor is removed. In this case the error-rate-based 
permutation VIM can be considered. If the goal is to 
rank the predictors according to their discrimination 
power, however, the AUC-based permutation VIM should 
be preferred. 

The problem of imbalance at the OOB data level is 
directly addressed with the use of a novel AUC-based 
permutation VIM. This VIM puts the same weight on 
both classes by measuring the difference in AUCs in- 
stead of the difference in error rates. It is thus able to 
detect changes in tree predictions when permuting asso- 
ciated predictors which might not be grasped by the 
standard permutation VIM. In contrast, the imbalance 
on training data level is not addressed by the AUC- 
based permutation VIM, meaning that the structure of a 
tree remains untouched. On the one hand this is a draw- 
back since class predictions before and after permuting a 
predictor are similar even if the respective predictor is 
associated with the response, resulting in a reduced 
change in the AUCs. On the other hand preserving the 
tree structure can be regarded as an advantage since a 
change in tree structure might open space for new unex- 
pected behaviours. It is a major advantage of our novel 
AUC-based permutation VIM that it is based on exactly 
the same principle and differs from the standard permu- 
tation VIM only with respect to the accuracy measure- 
ment. It is thus expected to share the advantages of the 
standard permutation VIM and its properties and behav- 
iours discovered in recent years (e.g. its behaviour in 
presence of correlated predictors [31] and in presence 
of predictors with different scales [22] and category 
sizes in the predictors [24,25]). 

Our studies on simulated as well as on real data show 
that the AUC-based permutation VIM outperforms the 
commonly used error-rate-based permutation VIM as well 
as the error-rate-based permutation VIM computed only 
using observations from the minority class in case of unbal- 
anced data settings (see Additional file 1 for the comparison 
to the class specific VIM). The difference in performance 
between our novel AUC-based permutation VIM and the 
standard permutation VIM can be substantial, especially for 
extremely unbalanced data settings. But even for slight class 
imbalances the AUC-based permutation VIM has shown to 
be superior to the standard permutation VIM. We con- 
clude from our studies that the AUC-based permutation 
VIM should be preferred to the standard permutation VIM 
whenever two response classes have different class sizes 
and the aim is to identify relevant predictors. 

Availability and requirements 

The AUC-based permutation VIM is implemented in 
the new version of the party package for the freely- 
available statistical software R (http://www.r-project.org 
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and http://cran.r-project.org/web/packages/party/index. 
html). It can be applied via the function varimpAUC. 

All codes implementing our studies on simulated and on 
real data are available under http://www.ibe.med.uni- 
muenchen.de/ organisation/mitarbeiter/ 070_drittmittel/ 
janitza/index.html for reproducibility purposes. 

Additional files 



Additional file 1: This file shows the results of the performance 
comparison between the AUC-based permutation VIM and the 
error-rate-based permutation VIM computed using only 
observations from the minority class. 

Additional file 2: This file shows the distribution of AUC-values 
(analog to Figure 4) for sample sizes n = 500 and n = 1000. 
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